When is the allele-sharing dissimilarity between two populations exceeded by the allele-sharing dissimilarity of a population with itself?

Abstract Allele-sharing statistics for a genetic locus measure the dissimilarity between two populations as a mean of the dissimilarity between random pairs of individuals, one from each population. Owing to within-population variation in genotype, allele-sharing dissimilarities can have the property that they have a nonzero value when computed between a population and itself. We consider the mathematical properties of allele-sharing dissimilarities in a pair of populations, treating the allele frequencies in the two populations parametrically. Examining two formulations of allele-sharing dissimilarity, we obtain the distributions of within-population and between-population dissimilarities for pairs of individuals. We then mathematically explore the scenarios in which, for certain allele-frequency distributions, the within-population dissimilarity – the mean dissimilarity between randomly chosen members of a population – can exceed the dissimilarity between two populations. Such scenarios assist in explaining observations in population-genetic data that members of a population can be empirically more genetically dissimilar from each other on average than they are from members of another population. For a population pair, however, the mathematical analysis finds that at least one of the two populations always possesses smaller within-population dissimilarity than the value of the between-population dissimilarity. We illustrate the mathematical results with an application to human population-genetic data.


Introduction
Statistics that measure the genetic dissimilarity between pairs of populations are widely used for interpreting population-genetic data (Bowcock et al. 1994;Chakraborty and Jin 1993;Gao and Martin 2009;Mountain and Cavalli-Sforza 1997;Mountain and Ramakrishnan 2005;Rosenberg 2011;Tal 2013;Witherspoon et al. 2007).Patterns in numerical values of the statistics appear in calculations of the relative similarity and dissimilarity of different human groups (Mountain and Ramakrishnan 2005;Rosenberg 2011;Witherspoon et al. 2007).Further, genetic dissimilarity statistics, often termed "genetic distances," underlie frequently applied tools for data analysis and visualization, including methods such as evolutionary tree construction (Bowcock et al. 1994) and multidimensional scaling (Gao and Martin 2009).
Population-level genetic dissimilarity statistics computed at a single genetic locus often proceed by considering pairs of vectors, p and q, representing the allele frequencies of two populations.Each vector consists of nonnegative entries that sum to 1. Hence, for a locus with I distinct alleles, such a genetic dissimilarity statistic has domain Δ I−1 × Δ I−1 , where Δ I−1 is the simplex { p 1 , p 2 , … , p I : ∑ I i=1 p i = 1 and p i ≥ 0 for all i } .
Among the many genetic dissimilarity statistics that are available (Jorde 1985;Nei 1987), those known as allele-sharing dissimilarities form a distinctive subset.Such statistics view a dissimilarity between two populations as the mean of a dissimilarity between pairs of individuals, one from one population and one from the other.With this perspective, they have a simple interpretation as a population-level generalization of an individual-level statistic.They also have a natural connection to a fundamental computation in human population genetics -the apportionment of genetic diversity among different levels of genetic structure (Edge et al. 2022;Lewontin 1972) -which can be viewed in terms of various mean pairwise dissimilarities across certain subsets of individuals (Rosenberg 2011).
Unlike most dissimilarity statistics -such as those based on such principles as the Euclidean distance between functions of allele frequency vectors (Cavalli-Sforza and Edwards 1967) or the dot product of these vectors (Nei 1972) -because they emerge from inter-individual computations among non-identical individuals, allele-sharing dissimilarities can produce nonzero values for the dissimilarity between a polymorphic population and itself.This feature assists in understanding a property of genetic variation in structured populations: the extent to which genetic dissimilarity of individuals from the same population ever exceeds genetic dissimilarity of individuals from different populations, if at all.
Because individuals in a population generally possess a larger number of recent shared ancestors than individuals from different populations, a perspective focused on population-genetic descent predicts that individuals from the same population will be genetically more similar than individuals from different populations.Indeed, in human population genetics, studies of allele-sharing dissimilarity find that the mean dissimilarity across pairs of individuals from different populations does exceed the mean dissimilarity for pairs from the same populations (Mountain and Ramakrishnan 2005;Rosenberg 2011;Tal 2013;Witherspoon et al. 2007).However, such studies also find a perhaps unexpected result that the allele-sharing dissimilarity for some pairs of individuals from the same population can exceed the dissimilarity for some pairs from different populations.
Here, we seek to explain the properties of allele-sharing dissimilarities within and between populations.We study mathematical properties of population-level allele-sharing dissimilarities under the assumption that individuals in a population represent random draws from the vector of allele frequencies in the population.We consider mean allele-sharing dissimilarities for pairs of individuals from the same population and for pairs of individuals from different populations, evaluating the conditions on allele-frequency vectors under which the allele-sharing dissimilarity for a population to itself can exceed the allele-sharing dissimilarity between two populations.We interpret the results in relation to ongoing efforts to understand human genetic similarity and difference.

Allele-sharing dissimilarities
An allele-sharing dissimilarity (ASD) is a type of dissimilarity that is based on counting the number of alleles shared at a locus between two diploid individuals.We consider two different versions of the ASD concept.
In one ASD variant, which we denote by  1 , "allele-sharing" for two diploid individuals is interpreted as the number of shared elements in their multisets of alleles.Consider a locus with four distinct alleles, the minimum number required so that all possible cases exist.Call these alleles A, B, C, and D. For  1 , two individuals both with genotype AB have 2 alleles shared, as the sets {A, B} and {A, B} have 2 identical elements.An individual with genotype AB and an individual with genotype AC have 1 allele shared, as the sets {A, B} and {A, C} have 1 element shared between them, namely A. Two individuals with genotype AA have 2 alleles shared, as multisets {A, A} and {A, A} have 2 shared elements, A and A. The dissimilarity  1 then uses 1 minus half the number of the shared alleles as the dissimilarity; the normalization ensures that  1 lies in [0,1] (Gao and Martin 2009;Mountain and Cavalli-Sforza 1997).With 0, 1, and 2 shared alleles, the dissimilarity equals 1, 1 2 , and 0, respectively.Another variant of ASD, which we denote by  2 , instead considers alleles individually, evaluating the fraction of pairs of alleles, one from the first individual and one from the second, that are distinct (Mountain and Ramakrishnan 2005).For two individuals with Table 1: Two variants of allele-sharing dissimilarity.All possible pairs of unordered genotypes are shown, along with their values of  1 and  2 .

Case
Genotypes Table 1 shows all seven possible pairs of unordered diploid genotypes for two individuals and their corresponding dissimilarities measured by  1 and  2 .In only two of seven cases do the two dissimilarities differ.

Notation
Consider a locus with I distinct alleles.We consider allele-frequency vectors in each of two populations.In Population 1, the allele frequencies are p = (p 1 , p 2 , … , p I ), where p i represents the frequency of allele i.In Population 2, they are q = (q 1 , q 2 , … , q I ).The frequencies satisfy 0 ≤ p i , q i ≤ 1 for all i, and We are interested in mathematical properties of the distribution of ASD measure , for pairs of populations -possibly the same population -where  can refer to  1 or  2 .We denote the dissimilarity  between two randomly chosen individuals within the same population with allele-frequency vector p by   (p), and the corresponding dissimilarity between two randomly chosen individuals from different populations with allele-frequency vectors p and q by  b (p, q).We often drop the arguments for convenience.
We will have occasion to use various symmetric sums involving allele frequencies.For t = 1, 2, 3, 4, for expressions in the separate populations, we use the notation where  1 =  1 = 1.

Assumptions
We seek to perform ASD computations under the assumption that individuals are sampled at random from allele-frequency distributions.With this perspective, for a random pair of individuals, an ASD measure is a random variable that depends on the allele-frequency vectors of two populations of interest, treated as parameters.At a given locus, we assume that the two alleles of an individual are sampled independently, so that diploid genotypes in a population are assumed to follow Hardy-Weinberg proportions.In other words, the probabilities of diploid genotypes in a population with allele-frequency vector p equal p 2 i for homozygous genotypes and 2p i p j for heterozygous unordered genotypes, with i ≠ j.

Distribution of  𝒘
We first compute allele-sharing dissimilarities between random pairs of individuals sampled from the same population, evaluating the properties of random variables   1 and   2 .

1
  1 is a random variable that takes on values 0, 1 2 , and 1.We compute its probability distribution, and we then evaluate its mean and variance.
We obtain the probability for each possible genotype combination in Table 1.These probabilities appear in Table 2, both as sums and as simplified polynomials.
With the probabilities of all genotype combinations obtained, we can sum across genotype combinations to compute probabilities for   1 (p) to equal 0, 1 2 , and 1.The resulting probabilities appear in Table 3.

𝔼[ 𝒘
1 ].The expected value of   1 (p) can be computed from the full probability distribution, via Table 2: Probabilities of genotype combinations for pairs of individuals sampled from the same population.For each case, the probability is written as a sum, which is then simplified using Eq.(1).

Genotypes Probability
Simplified probability Table 3: Probability distribution of   1 (p), the allele-sharing dissimilarity   1 for a pair of individuals sampled at random from a population with allele-frequency vector p.The table is obtained by summing entries in Table 2. Using the probabilities in Table 3, the result is

Value of the dissimilarity (d)
(3) In the I = 2 case, using p 2 = 1 − p 1 so that  t = p t 1 + (1 − p 1 ) t , Eq. (3) becomes: Figure 1A plots Eq. ( 4) as a function of p 1 .In the figure, we can observe that the mean value of the dissimilarity increases from a value of 0 at p 1 = 0, when the population is monomorphic, to a peak of 3 8 at p 1 = 1 2 .It then decreases symmetrically to 0 at p 1 = 1.

Var[ w
1 ].To obtain the variance of the distribution of   1 (p), we first calculate (5) The variance can then be obtained from Eqs.
(3) and ( 5) by Var[ For the I = 2 case, we once again use that p 2 = 1 − p 1 : Figure 1B plots Eq. ( 8) as a function of p 1 .Like the mean, the variance of the dissimilarity increases from 0 at p 1 = 0 to a peak at p 1 = 1 2 , decreasing symmetrically to 0 at p 1 = 1.The maximal variance is 7 64 .

2
We compute the distribution of random variable   2 .This computation uses the same probabilities for genotype pairs as those used for   1 in Table 2.
We compute the probability for each of the possible values of   2 by summing probabilities in Table 2.The resulting probabilities appear in Table 4.
. Summing across the possible values for the dissimilarity, yielding the result Note that Eq. ( 9) gives the "expected heterozygosity," the probability that two draws from the allele-frequency distribution produce distinct alleles.
For I = 2, Eq. ( 15) can be observed in Figure 1A, as it can be seen that the curve for for the case of I = 3, and the maximal difference in the figure also occurs when alleles have the same frequency, ( p 1 , p 2 , p 3 ) = For the variances, Figure 1B finds that for for intermediate p 1 , and that the two variances are comparable for p 1 near 0 or 1, with some p 1 values producing Var 2F illustrates a similar result for I = 3.For both I = 2 and ; at extreme allele frequencies, the two variances are comparable, sometimes with Var

Distribution of  b
We now examine allele-sharing dissimilarities between pairs of individuals from different populations.Let p be the allele frequency vector for the population from which the first individual is sampled, and let q be the corresponding vector for the population of the second individual; the special case of q = p follows Section 3. We evaluate the properties of the random variables  b 1 and  b 2 .

Distribution of 
. We obtain the probability for each possible genotype combination for a pair of individuals from different populations.For this computation, we use the polynomials in Eqs. ( 1) and ( 2).The resulting probabilities appear in Table 5.
We sum across genotype combinations to obtain probabilities for  b 1 to equal particular values.Probability of genotype combinations for pairs of individuals sampled from two populations.For each case, the probability is written as a sum, which is then simplified using Eqs.( 1) and (2).

Probability
Simplified probability Table 6: Probability distribution of  b 1 (p, q), the allele-sharing dissimilarity  b 1 for a pair of individuals sampled at random from two populations with allele-frequency vectors p and q.The table is obtained by summing entries in Table 5.

Value of the dissimilarity (d)
As we did for the within-population dissimilarity   1 (p), we compute the expected value of the distribution of the between-population dissimilarity  b 1 (p, q) as Using the values in Table 6, we obtain For the I = 2 case, with p 2 = 1 − p 1 and q 2 = 1 − q 1 , Eq. ( 16) simplifies to Figure 3A plots Eq. ( 17).The figure has maxima of 1 at (p 1 , q 1 ) = (1, 0) and (0,1), when the two populations have the greatest difference in allele frequency, and equals 0 at (0,0) and (1,1).It has a saddle surface with a value of 3 8 at saddle point ( p 1 , q 1 ) = ( 1 2 , 1 2 ).

Distribution of 
. We use Table 5 to obtain the probabilities of particular values of  b 2 .The resulting probabilities appear in Table 7.
2 , we substitute the values from Table 7 into We obtain This quantity is the between-population analogue of expected heterozygosity, the probability that two random draws, one from the allele-frequency distribution of a locus in one population and one from the corresponding distribution in a second population, represent the same allele.
For the I = 2 case, Eq. ( 22) simplifies to Table 7: Probability distribution of  b 2 (p, q), the allele-sharing dissimilarity  b 2 for a pair of individuals sampled at random from two populations with allele-frequency vectors p and q.The table is obtained by summing entries in Table 5.

Comparison of  b 1 and  b 2
The two measures for the between-population dissimilarity have the same expected value, , if for all i, at least one of p i , 1 − p i , q i , and 1 − q i is zero.The condition for equality can be seen from Excluding these equality cases, we have Note that  b 1 ≤  b 2 for all possible genotype combinations in Table 1.
The inequality in Eq. ( 28) can be observed for the I = 2 case in Figure 3C, where the surface plot of remains greater than or equal to 0, with equality only on the boundary.The largest difference occurs at 2 for the case of I = 2. Across most of the parameter space, . The excess is greatest at points ( p 1 , q 1 ) = ( 1 3 , 2 3 ) and ( 23 , 1 3 ).

The relative magnitudes of 𝔼[ 𝒘 ] and 𝔼[ b ]
We now examine the relative magnitudes of the expectations [  ] and [ b ].We determine the conditions under which the expectation of a within-population dissimilarity exceeds that of a between-population dissimilarity.
Proof.We simplify Eq. ( 29) noting p 2 = 1 − p 1 and q 2 = 1 − q 1 .To find the region where 1 (p, q)], we solve the polynomial inequality with 0 ≤ p 1 ≤ 1 and 0 ≤ q 1 ≤ 1. Solving for q 1 in terms of p 1 , we find that the expression in Eq. ( 33) is 0 at q 1 = p 1 and at q 1 = g(p 1 ), and for fixed p, it is positive when q lies between the two roots.The unique real root for g(x) = x is at x = 1 2 , so that g(p 1 ) < p 1 for p 1 < 1 2 and g(p 1 ) > p 1 for p 1 > 1 2 .

□
Figure 4A plots the region identified in Theorem 1.That a nonempty region exists indicates that sometimes, allele frequencies for a biallelic locus produce a within-population dissimilarity that exceeds the betweenpopulation dissimilarity.Note that because the choice of which allele is labeled 1 and which is labeled 2 is arbitrary, (p 1 , q 1 ) is included in the region if and only if (1 − p 1 , 1 − q 1 ) is also included.
We can calculate the area of the region in the unit square representing the probability ℙ ) under the assumption that p 1 and q 1 are independently and identically distributed with uniform-[0,1] distribution: ) more generally, for each I from 2 to 20, we perform a simulation.In particular, for each I, we consider independently and identically distributed vectors p and q from the uniform distribution over the simplex Δ I−1 (the Dirichlet-(1, 1, … , 1) distribution, where the vector of 1's has length I).
We sample 100,000 replicate pairs (p, q), and for each pair we evaluate if Figure 5A plots the resulting probability.We can observe that for I = 2, the simulated ℙ ) accords with the analytical value in Eq. ( 34).The probability then decreases with increasing I.
Theorem 2. Consider a locus with I = 2 distinct alleles.For individuals sampled from two populations with allele frequency vectors p = (p 1 , 1 − p 1 ) and q = (q 1 , 1 − q 1 ), Proof.With p 2 = 1 − p 1 and q 2 = 1 − q 1 , Eq. ( 35) simplifies to Solving this inequality, we arrive at the result.□ Figure 4B plots the region identified in Theorem 2. This region describes the locations in which allele frequencies for a biallelic locus produce a within-population dissimilarity that exceeds the between-population dissimilarity.As is true for  1 , (p 1 , q 1 ) is included in the region if and only if (1 − p 1 , 1 − q 1 ) is also included.
The area of the region in the unit square, representing ℙ ) under the assumption that p 1 and q 1 are independently and identically distributed with uniform-[0,1] distribution, is straightforward: ) for each I from 2 to 20 by simulation.For each I, we consider independently and identically distributed vectors p and q from the uniform distribution over the simplex Δ I−1 (the Dirichlet-(1, 1, … , 1) distribution), sampling 100,000 replicate pairs (p, q), and evaluating the fraction of pairs for which Figure 5B plots the resulting probability, illustrating the agreement between the simulated ) and the analytical value in Eq. ( 38) for I = 2.The probability then decreases as I increases.

Comparison of the 𝔼[
where the mean dissimilarity between individuals from the same population exceeds that between individuals from different populations, holds under different scenarios for  1 and  2 .
Comparing Eqs. ( 34) and ( 38), we see that for the case of 1 ] holds over a smaller fraction of the parameter space than the corresponding inequality 4).Further, if the former inequality holds, then the latter always holds as well.
In Figure 5, we also observe that the probabilities ℙ ( ) are higher for  2 than for  1 in simulations with different numbers of alleles.Hence, use of  2 rather than  1 produces a greater probability that the within-population genetic dissimilarity exceeds the between-population dissimilarity.

The relative magnitudes of 𝔼[ 𝑤 ] and 𝔼[ b ]
We have seen that both for  1 and for  2 , it is possible for the expected dissimilarity [  ] of random pairs of individuals within a population to exceed the expected dissimilarity [ b ] of random pairs between that population and a second population.However, we will see that for a pair of populations, the mean of their two within-population dissimilarities never exceeds their between-population dissimilarity.
For a pair of populations with allele frequency vectors p and q, let [
Proof.We use Eqs.( 3) and ( 16) Rewriting in terms of the vectors p, q, p, and q, we have Equality is reached in the last step if and only if p = q.□
Proof.We rewrite [  2 (p, q)] − [ b 2 (p, q)] using Eqs.( 9) and ( 22): In terms of the vectors p and q, we have with equality if and only if p = q.□

Comparison of the 𝔼[
The inequality [  (p, q)] ≤ [ b (p, q)], with equality if and only if p = q, holds for both  1 and  2 .Comparing the proofs of Theorems 3 and 4, we see that The extent to which [ 1 (p, q)], has a greater absolute value than the corresponding extent to which

Data
Our theoretical analysis predicts features of dissimilarities  1 and  2 in within-population and betweenpopulation computations.To compare to empirical observations, we examine multiallelic microsatellite data from the Human Genome Diversity Project (HGDP-CEPH panel).We consider the 1048 individuals and 783 microsatellite loci from Rosenberg et al. (2005), employing the H1048 subset of the HGDP-CEPH panel (Rosenberg 2006).We follow previous uses of the HGDP-CEPH panel in considering 53 populations and 7 geographic regions.We focus on 30 populations for which the number of sampled individuals is greater than 15.Across these 30 populations, the total number of individuals considered is 813.

Theoretical computations
For our theoretical calculations, given a population in the data set and a locus, we compute allele frequencies.We then apply our theoretical formulas to the allele frequency vectors.Note that if a locus is missing genotypes in an individual, then we omit that individual from the calculation of population allele frequencies at the locus, so that we maintain the property that allele frequencies at a locus in a population sum to 1.

Empirical computations
For empirical calculations, we consider the actual diploid individuals in the HGDP-CEPH data, for withinpopulation computations comparing all pairs of individuals within a population.For between-population computations, we compare all pairs of individuals, one each from two populations.Pairwise dissimilarities between diploid genotypes are obtained according to Table 1.We compute within-population and between-population dissimilarities as the means across relevant pairs, and we compute variances of dissimilarity distributions across pairs of individuals.
For this analysis, we omit individuals with missing data prior to computation of empirical ASD values.In between-population comparisons, all allelic types present in one but not the other population are assigned a frequency of 0 in the population in which they are absent.
We perform the theoretical and empirical calculations for all 783 loci.

Results of data analysis
Figure 6 compares empirical and theoretical means and variances of within-population dissimilarities across pairs of individuals, considering 100 randomly sampled loci in 30 populations.Figure 6A compares the empirical value of [  1 ] computed by averaging   1 values for all pairs of sampled individuals with the theoretical value predicted from the allele frequencies and Eq. ( 3).The theoretical calculation generally predicts the empirical dissimilarity, with most points clustering along the diagonal (r = 0.962).In Figure 6B, a similar plot for [  2 ] using Eq. ( 9) for the theoretical computation produces closer agreement between the empirical and theoretical values (r = 0.999).
Figure 6C and D compare empirical and theoretical variances across pairs of individuals for withinpopulation dissimilarities, using Eqs.( 6) and ( 12) for the theoretical computation.The theoretical variance predicts the empirical variance, but the agreement is not as close as for the mean (r = 0.676 for Var[  1 ], r = 0.732 for Var[  2 ]). Figure 7 plots analogous comparisons for between-population dissimilarities, considering a subset of loci from Figure 6.In Figure 7A 2 ] and the theoretical value (r = 1.000).
Figure 7C and D consider relationships between empirical and theoretical between-population variances for  1 and  2 .As was observed in Figure 6C and D, empirical and theoretical variance are correlated (r = 0.676 ), but the agreement for variances is not as close as for the mean.Figure 8 empirically examines the inequalities in Theorems 3 and 4 stating that when computed from allele frequencies, the mean of the within-population dissimilarities for two populations is always less than the dissimilarity between them.It shows all population pairs from Figures 6 and 7 with a single random locus.
In Figure 8A, we find that the theoretical values of [ b 1 ] and [  1 ], computed from allele frequencies alone, follow the predicted inequality, with . However, the theorem does not necessarily apply to dissimilarities computed from actual diploid individuals, and indeed, some exceptions are observed in which the empirical  8C).Similar results hold for [ b 2 ] and [  2 ] in Figure 8B and D.
Figure 9 tabulates the fraction of loci for which the empirical within-population dissimilarity of a population (denoted Population 1) exceeds the population's empirical between-population dissimilarity with a second population (Population 2), or [  ] >  [ b ].The populations are arranged geographically, following a general decrease in within-population genetic diversity with migration from Africa, as measured by expected heterozygosity 1 −  2 (Prugnolle et al. 2005;Ramachandran et al. 2005).In Figure 9A, for  1 , if Population 1 is a population with relatively low within-population heterozygosity, such as a Native American population, then its within-population dissimilarity rarely exceeds its between-population dissimilarity with a second population (rightmost columns).The fraction of loci for which [  ] > [ b ] is greatest for intermediate-heterozygosity South Asian populations (central columns).If Population 2 is a high-heterozygosity African population, then for all non-African choices of Population 1, the within-population dissimilarity of Population 1 rarely exceeds the between-population dissimilarity with an African Population 2 (bottom rows).Similar patterns are seen in Figure 9B for  2 , with the additional observation that the within-population dissimilarity of Population 1 often exceeds the between-population dissimilarity when low-heterozygosity Native American populations are placed in the role of Population 2 (top rows).  .Empirical values rely on dissimilarity calculations according to Table 1 from pairs of diploid individuals, and theoretical values are calculated from allele frequencies according to Eqs. ( 16), (19), ( 22) and (25).

Discussion
Allele-sharing statistics are often used to quantify genetic dissimilarity within and between populations.Because they typically share a larger number of recent ancestors, individuals from the same population might be predicted to possess a lower genetic dissimilarity than those from different populations.We have mathematically explored the circumstances under which this prediction fails, when the genetic dissimilarity within a population exceeds the genetic dissimilarity between two populations.The analysis characterizes the properties of allele frequency vectors that give rise to this counterintuitive scenario, illustrating its occurrence in human population-genetic data.
When does within-population dissimilarity for a population exceed between-population dissimilarity with a second population?The conditions that permit this inequality in the case of I = 2 alleles are instructive (Theorems 1 and 2 and Figure 4).In this case, two populations have unbalanced allele frequencies, with Population 2 more unbalanced than Population 1, but the two populations are similar in their frequencies.In Population 1, dissimilarity is generated from comparisons of homozygotes for one allele and homozygotes for the other allele.However, because Population 2 has allele frequencies that are more unbalanced than those of Population 1, fewer comparisons of distinct homozygotes occur in the between-population comparison.This phenomenon results in a within-population dissimilarity in Population 1 that exceeds the between-population dissimilarity.
Beyond I = 2, such an excess is observed in empirical calculations with I ≥ 2 alleles (Figure 9), as well as in simulations, though with decreasing probability as I increases (Figure 5).Although a population can possess greater within-population dissimilarity than its between-population dissimilarity to a second population, we find that for arbitrary numbers of alleles I, it is not possible for both populations in a pair to possess greater within-population dissimilarity than the between-population dissimilarity (Theorems 3 and 4).In data, "theoretical" dissimilarities obtained by treating allele frequencies in the data as parametric frequencies of two populations follow this inequality strictly, with greater between-population dissimilarity than at least one of the two within-population dissimilarities (Figure 8A and B).Similarly, the mean of the two within-population dissimilarities is strictly less than the between-population dissimilarity in theoretical calculations (Figure 8A and B); while "empirical" dissimilarities calculated from individual genotypes can violate the inequality, we find that these violations are generally mild (Figure 8C and D).
The results can contribute to understanding unexpected phenomena involving allele-sharing dissimilarities in human populations.We have seen that within-population dissimilarities in Population 1 sometimes exceed between-population dissimilarities, often in comparisons that involve a lower-diversity Population 2 and a higher-diversity Population 1 (Figure 9); in essence, a high-diversity population can possess enough variation that its inter-individual dissimilarity can exceed the dissimilarity between populations.Our theoretical calculations provide a basis for this scenario, and in fact, we saw for I = 2 that it is not unlikely in certain parts of the allele frequency space (Figure 4).
The two variants of allele-sharing dissimilarity that we studied,  1 and  2 , share many features.For I = 2 and I = 3 alleles, the expected values of   ) for I ≥ 2 (Figure 5).
In the empirical analysis,  2 has a closer match between empirical and theoretical mean values of the dissimilarity (Figures 6B and 7B).Its patterns in the fraction of loci for which [  ] > [ b ] align more closely with the heterozygosity values of the populations, with the probability of [  ] > [ b ] larger when Population 1 is a higher-diversity population and Population 2 is a lower-diversity population (Figure 9B).Notably, expressions for [ 2 ] are closely tied to heterozygosity (Eq.( 9)) and its between-population analogue (Eq.( 22)), potentially explaining the tighter connection of heterozygosity to its associated observations.Thus, the lesserused  2 -which, unlike  1 , allows the dissimilarity of an individual and itself to be nonzero (Table 1) -does possess a more easily interpreted pattern in the probability that [  ] > [ b ].
Does our analysis suggest a preference for  1 over  2 , or vice versa?To summarize,  1 has been used more frequently than  2 , and it also has the property that the dissimilarity of an individual and itself is zero.The less frequently used  2 does not have this property, but it produces simpler expressions for its withinpopulation and between-population expectations, with more natural interpretations of those expectations and their consequences.We conclude that although  1 has a number of desirable properties,  2 does as well, and it perhaps merits attention commensurate with that given to  1 .
This work has several possible extensions.We have focused on the first and second moments of allelesharing dissimilarities across pairs of individuals; the full distributions (Tables 3, 4, 6, 7) could also be further investigated.We examined I = 2 in the greatest detail, but special cases that fix a maximal value of I could also be considered.We chose the two most frequently used ASD variants,  1 and  2 , but a variant designed for genotypes obtained by observation of band patterns (Chakraborty and Jin 1993) could also be studied.
We note significant caveats in interpreting our empirical analysis in relation to our theoretical computations.The empirical computations make use of all pairs of individuals drawn from specified samples; each sampled individual appears in many pairs, so that the empirical analysis does not follow the assumption of the theoretical analysis that pairs represent independent draws from allele frequency distributions.A second difference of the empirical and theoretical analyses is that the theoretical analysis assumes that pairs of alleles within an individual are independent draws from the allele-frequency distribution, whereas inbreeding can induce dependence of these alleles empirically.Such deviations from the assumptions of the theoretical analysis in conducting the empirical analysis could be explored in simulations that do and do not permit inbreeding and reuse of pairs of individuals and in empirical samples large enough to avoid such reuses.
Allele-sharing dissimilarities have long been used in population genetics.The mathematical relationships we have obtained assist both in predicting their properties in relation to allele frequencies and in understanding empirical aspects of their values.When counterintuitive phenomena are obtained with such dissimilaritiessuch as a greater within-population dissimilarity than the between-population dissimilarity -the mathematical results can potentially provide insight into the unexpected observations.
because among the four possible pairs of alleles -(A, A), (A, B), (B, A), and (B, B), where the first entry in the pair represents an allele from the first individual and the second entry is an allele from the second individual -two of four contain distinct alleles.

Figure 1 :
Figure 1: Mean and variance of the within-population dissimilarities   1 and   2 for I = 2 alleles as functions of the frequency p 1 of one

Figure 2 :
Figure 2: Mean and variance of the within-population dissimilarities   1 and   2 for I = 3 alleles as functions of the frequencies p 1 and p 2 of two of the alleles.(A) Mean of   1 , Eq. (3).(B) Mean of   2 , Eq. (9).(C) [

Figure 3 :
Figure 3: Mean and variance of the between-population dissimilarities  b 1 and  b 2 for I = 2 alleles as functions of the frequencies (p 1 , q 1 )

Figure 5 :
Figure 5: The probability ℙ ( [  ] > [ b ] ) for simulated pairs of allele frequency vectors (p, q) with I distinct alleles.(A)  1 .(B)  2 .Independent and identical uniform distributions are simulated for each I, 2 ≤ I ≤ 20, by drawing uniformly from the simplex Δ I−1 , we see a close relationship between empirical [ b 1 ] and theoretical [ b 1 ] similar to the relationship observed in Figure 6A (r = 0.943).As was seen in Figure 6B, in Figure 7B, we see a stronger relationship between the empirical value of [ b

Figure 7 : 2 )=
Figure 7: Empirical and theoretical mean and variance of between-population allele-sharing dissimilarities.Each panel considers 10 randomly sampled loci in pairs among the 30 populations with sample size greater than 15 (10 ×

Figure 8 :
Figure 8: Empirical and theoretical [ b ] and [  ].Each panel considers a random locus, D1S1677, in 435 pairs of populations with sample size greater than 15.The locus is among those used in Figures 6 and 7.The upper left triangle is the region in which the between-population dissimilarity of two populations exceeds the mean of the within-population dissimilarities of the two populations, [ b ] > [  ], as proven for theoretical disimilarities (Theorems 3 and 4).The two ends of a horizontal gray line indicate the [  ] values for two populations whose mean within-population dissimilarity is plotted at the midpoint of the line.(A) Theoretical values of  1 .(B) Theoretical values of  2 .(C) Empirical values of  1 .(D) Empirical values of  2 .

Figure 9 :
Figure 9: Fraction of loci for which [ b ] < [  ].Each panel considers all 783 loci in pairs among the 30 populations with sample size greater than 15.Each cell denotes a pair of populations, with Population 1 considered for the within-population dissimilarity.Geographical regions are separated by bold black lines.(A)  1 .(B)  2 .

1
and   2 are maximal when all alleles have the same frequency (Figures 1A and 2A, B).Trends in expectations of  b 1 and  b 2 at I = 2 are also similar (Figure 3A and B), as are the regions in which [  ] > [ b ] for I = 2 (Figure 4), and the simulated probabilities ℙ ( [  ] > [ b ]

Table 4 :
Probability distribution of   2 (p), the allele-sharing dissimilarity   2 for a pair of individuals sampled at random from a population with allele-frequency vector p.The table is obtained by summing entries in Table2.